# Multi-label Prediction
Code for Multi-label Prediction for Political Text-as-Data

## Abstract
Political scientists increasingly use supervised machine learning to code multiple relevant labels from a single set of texts. The current "best practice" of individually applying supervised machine learning to each label ignores information on inter-label association(s), and is likely to under-perform as a result. We introduce multi-label prediction as a solution to this problem. After reviewing the multi-label prediction framework, we apply it to code  multiple features of (i) access to information requests made to the Mexican government and (ii) country-year human rights reports. We find that multi-label prediction outperforms standard supervised learning approaches, even in instances where the correlations among one's multiple labels are low.


## Repository structure

| File/Folder | Description |
| ------ | ------ |
| `./data_clean/` | Contains processed data |
| `./data_raw/` | Contains raw data (before preprocessing) | 
| `./plots/` |Contains plots of the results. |
| `./results/` |Contains results as csv files. |
| `./scripts/` | Contains scripts to preprocess, optimize and run classification models |



## Setup
These analyses were run using Python 3.7.4. All of the required libraries are in the `requirements.txt` file. The following line will install these libraries locally:
Run ``` pip install -r requirements.txt ``` on the terminal.

## Replication information

### In order to replicate *all* results in the paper, please follow this step:
- Inside the `scripts` folder run ``` python replicate_results.py ``` on the terminal. It can take up to 6 hours depending on your machine specifications.



For replicating individual results, please see below.

### In order to replicate the Access to Information (ATI) results (Table 2, Table E1, Table E4)  please follow these steps:

1) Inside the `scripts` folder run ``` python run_models_ati.py 'all' ``` on the terminal. It can take up to 6 hours depending on your machine specifications.
2) Run ``` python gen_tables_e1_e4.py ```. The results are stored in the `results/ATI` folder. A version of Table 2 is also generated from this file where there are binary indicators where the checkmarks go in the LaTeX version. Table is stored in `results/ATI/table_2.csv`. Please note that both tables E1 and E4 are divided in two files (one for the average results and the other for the standard deviation).
More specifically, the results to replicate Table E1 are `results/ATI/table_e1_mean.csv` and `results/ATI/table_e1_std.csv`.
Similarly, the results to replicate Table E4 are `results/ATI/table_e4_mean.csv` and `results/ATI/table_e4_std.csv`.


You can also run each model individually by setting the argument 'classifier' of ``` python run_models_ati.py 'classifier' ```.

| 'classifier' | Description |
| ------ | ------ |
| `all` | Runs all classifiers |
| `optimized_thresholds` | Runs binary relevance classifiers with optimized threshold values (for more details please see `scripts/optimized_thresholds.py`) |
| `optimized_models` | Runs binary relevance classifiers, each label has its own optimized classifier (for more details please see `scripts/optimized_thresholds.py`)|
| `SMOTE` | Runs binary relevance classifiers using SMOTE for oversampling|
| `rakel_4` | Runs RAkEL classifier with k = 4 (label set divided into 4) |
| `rakel_2` | Runs RAkEL classifier with k = 2  (label set divided into 2)|
| `LP` | Runs Label powerset |
| `CC` | Runs classifier chain |
| `ECC` | Runs ensemble classifier chain |

### In order to replicate the Human Rights results (Figure 2) please follow these steps:

1) Inside the `scripts` folder, execute on the terminal:  ``` python mc_simulations.py ```.
2) Fig.2 can be generated by running ``` python plot_hr.py ``` afterwards. The resulting figure (figure_2.png) is stored at the `plots` folder.


### In order to replicate the Monte Carlo results (Figures g1,g2,g3 and g4; Tables g1,g2,g3 and g4) please follow these steps:

1) Inside the `scripts` folder, execute on the terminal:  ``` python mc_simulations.py ```.
2) Run ``` python generate_results_MC.py ```. This code generates both the tables as well as the figures.
  - This code stores the files to replicate Table g1 at `results/MC Classification/table_g1_part1.csv` and `results/MC Classification/table_g1_part2.csv`. Part 1 is the results for the binary relevance and part 2 is for the ECC results.
  - The files to replicate Table g2, Table g3 and Table g4 are stored in `results/MC Regression` with the names `table_g2.csv`, `table_g3.csv` and `table_g4.csv`, respectively.
  - The figures (figure_g1.png,figure_g2.png,figure_g3.png,figure_g4.png) are stored inside the `plots` folder.

### In order to replicate all results please run:
1) Inside the `scripts` folder, execute on the terminal:  ``` python full_replication.py ```. (please see below for runtime estimations)


### Runtime results
1) To replicate the runtime results, please run the code from ``` colab_runtime.ipynb ``` on Google Colab service. More information can be found inside the script. NOTE: the results will vary by machine and are only indicative in the paper.
2) The time it takes to run all scripts on a google colab instance is around 6 hours.

## Virtual Machine specs: 
- GPU: Tesla T4
- CPU: 2-core Xeon 2.2GHz
- RAM: 13 GB
- OS: Linux Ubuntu 18.04


